Constructing specialised corpora through analysing domain representativeness of websites

نویسندگان

  • Wilson Wong
  • Wei Liu
  • Mohammed Bennamoun
چکیده

The role of the Web for text corpus construction is becoming increasingly significant. However, the contribution of the Web is largely confined to building a general virtual corpus or low quality specialised corpora. In this paper, we introduce a new technique called SPARTAN for constructing specialised corpora from the Web by systematically analysing website contents. Our evaluations show that the corpora constructed using our technique are independent of the search engines employed. In particular, SPARTAN-derived corpora outperform all corpora based on existing techniques for the task of term recognition.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

بررسی وب‌گاه‌های ادارات کل کتابخانه‌های عمومی ایران: مطالعه وب‌سنجی

Purpose: Through analysis of different types of web links, it is aimed in this study to evaluate the status of links in provincial websites of Iran Public Libraries Foundation. Methodology: Link analysis as a webometric method was used in the present research. Data collection was accomplished by LexiURL software and Yahoo search engine. The population under study included the Provincial websit...

متن کامل

Comparing two acquisition systems for automatically building an English-Croatian parallel corpus from multilingual websites

In this paper we compare two tools for automatically harvesting bitexts from multilingual websites: bitextor and ILSP-FC. We used both tools for crawling 21 multilingual websites from the tourism domain to build a domain-specific English–Croatian parallel corpus. Different settings were tried for both tools and 10,662 unique document pairs were obtained. A sample of about 10% of them was manual...

متن کامل

Survey on the Status of Persian-Language Health Services through the Internet

Abstract Background: The Internet has been able to convert the manner of information seeking and has changed the users’ approach to information particularly in health domain. In this regard, the number of Persian-language websites in health service are increasing. Therefore, information about the variety of services offered by them is very important. The present study was designed to describe ...

متن کامل

A THESIS PROPOSAL about USING ZIPF FREQUENCIES AS A REPRESENTATIVENESS MEASURE IN STATISTICAL ACTIVE LEARNING OF NATURAL LANGUAGE

Active learning has proven to be a successful strategy in quick development of corpora to be used in statistical induction of natural language. A vast majority of studies in this field has concentrated on finding and testing various informativeness measures for samples; however, representativeness measures for selected samples have not been thoroughly studied. In this thesis, we intend to intro...

متن کامل

Verbs in specialised corpora: from manual corpus-based description to automatic extraction in an English-French parallel corpus

This paper tackles the issue of verbs in specialised corpora in the view of term extraction. Corpus-based manual descriptions to be used in various applications have highlighted the “deviant” uses of verbs in specialised corpora compared with general uses as well as the need for verb extraction. However, very few attention has been given to verbs both in the terminology theory and automatic ter...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:
  • Language Resources and Evaluation

دوره 45  شماره 

صفحات  -

تاریخ انتشار 2011